Dynamic inspection of networking dependencies to enhance anomaly detection models in a network assurance service

ABSTRACT

In one embodiment, a network assurance service that monitors a network detects, using a machine learning-based anomaly detector, network anomalies associated with source nodes in the monitored network. The network assurance service identifies, for each of the detected anomalies, a set of network paths between the source nodes associated with the anomaly and one or more potential destinations of traffic for that source node. The network assurance service correlates networking devices along the network paths in the identified sets of network paths with the detected network anomalies. The network assurance service adjusts the machine learning-based anomaly detector to use a performance measurement for a particular one of the networking devices as an input feature, based on the correlation between the particular networking device and the detected network anomalies.

TECHNICAL FIELD

The present disclosure relates generally to computer networks, and, moreparticularly, to the dynamic inspection of networking dependencies toenhance anomaly detection models in a network assurance service.

BACKGROUND

Networks are large-scale distributed systems governed by complexdynamics and very large number of parameters. In general, networkassurance involves applying analytics to captured network information,to assess the health of the network. For example, a network assurancesystem may track and assess metrics such as available bandwidth, packetloss, jitter, and the like, to ensure that the experiences of users ofthe network are not impinged. However, as networks continue to evolve,so too will the number of applications present in a given network, aswell as the number of metrics available from the network.

BRIEF DESCRIPTION OF THE DRAWINGS

The embodiments herein may be better understood by referring to thefollowing description in conjunction with the accompanying drawings inwhich like reference numerals indicate identically or functionallysimilar elements, of which:

FIGS. 1A-1B illustrate an example communication network;

FIG. 2 illustrates an example network device/node;

FIG. 3 illustrates an example network assurance system;

FIG. 4 illustrates an example architecture for a network assuranceservice;

FIG. 5 illustrates an example dependency graph for a network;

FIG. 6 illustrates another example dependency graph for a network;

FIG. 7 illustrates an example plot of network throughput; and

FIG. 8 illustrates an example simplified procedure for dynamicinspection of networking dependencies to enhance anomaly detectionmodels.

DESCRIPTION OF EXAMPLE EMBODIMENTS Overview

According to one or more embodiments of the disclosure, a networkassurance service that monitors a network detects, using a machinelearning-based anomaly detector, network anomalies associated withsource nodes in the monitored network. The network assurance serviceidentifies, for each of the detected anomalies, a set of network pathsbetween the source nodes associated with the anomaly and one or morepotential destinations of traffic for that source node. The networkassurance service correlates networking devices along the network pathsin the identified sets of network paths with the detected networkanomalies. The network assurance service adjusts the machinelearning-based anomaly detector to use a performance measurement for aparticular one of the networking devices as an input feature, based onthe correlation between the particular networking device and thedetected network anomalies.

DESCRIPTION

A computer network is a geographically distributed collection of nodesinterconnected by communication links and segments for transporting databetween end nodes, such as personal computers and workstations, or otherdevices, such as sensors, etc. Many types of networks are available,with the types ranging from local area networks (LANs) to wide areanetworks (WANs). LANs typically connect the nodes over dedicated privatecommunications links located in the same general physical location, suchas a building or campus. WANs, on the other hand, typically connectgeographically dispersed nodes over long-distance communications links,such as common carrier telephone lines, optical lightpaths, synchronousoptical networks (SONET), or synchronous digital hierarchy (SDH) links,or Powerline Communications (PLC) such as IEEE 61334, IEEE P1901.2, andothers. The Internet is an example of a WAN that connects disparatenetworks throughout the world, providing global communication betweennodes on various networks. The nodes typically communicate over thenetwork by exchanging discrete frames or packets of data according topredefined protocols, such as the Transmission Control Protocol/InternetProtocol (TCP/IP). In this context, a protocol consists of a set ofrules defining how the nodes interact with each other. Computer networksmay be further interconnected by an intermediate network node, such as arouter, to extend the effective “size” of each network.

Smart object networks, such as sensor networks, in particular, are aspecific type of network having spatially distributed autonomous devicessuch as sensors, actuators, etc., that cooperatively monitor physical orenvironmental conditions at different locations, such as, e.g.,energy/power consumption, resource consumption (e.g., water/gas/etc. foradvanced metering infrastructure or “AMI” applications) temperature,pressure, vibration, sound, radiation, motion, pollutants, etc. Othertypes of smart objects include actuators, e.g., responsible for turningon/off an engine or perform any other actions. Sensor networks, a typeof smart object network, are typically shared-media networks, such aswireless or PLC networks. That is, in addition to one or more sensors,each sensor device (node) in a sensor network may generally be equippedwith a radio transceiver or other communication port such as PLC, amicrocontroller, and an energy source, such as a battery. Often, smartobject networks are considered field area networks (FANs), neighborhoodarea networks (NANs), personal area networks (PANs), etc. Generally,size and cost constraints on smart object nodes (e.g., sensors) resultin corresponding constraints on resources such as energy, memory,computational speed and bandwidth.

FIG. 1A is a schematic block diagram of an example computer network 100illustratively comprising nodes/devices, such as a plurality ofrouters/devices interconnected by links or networks, as shown. Forexample, customer edge (CE) routers 110 may be interconnected withprovider edge (PE) routers 120 (e.g., PE-1, PE-2, and PE-3) in order tocommunicate across a core network, such as an illustrative networkbackbone 130. For example, routers 110, 120 may be interconnected by thepublic Internet, a multiprotocol label switching (MPLS) virtual privatenetwork (VPN), or the like. Data packets 140 (e.g., traffic/messages)may be exchanged among the nodes/devices of the computer network 100over links using predefined network communication protocols such as theTransmission Control Protocol/Internet Protocol (TCP/IP), User DatagramProtocol (UDP), Asynchronous Transfer Mode (ATM) protocol, Frame Relayprotocol, or any other suitable protocol. Those skilled in the art willunderstand that any number of nodes, devices, links, etc. may be used inthe computer network, and that the view shown herein is for simplicity.

In some implementations, a router or a set of routers may be connectedto a private network (e.g., dedicated leased lines, an optical network,etc.) or a virtual private network (VPN), such as an MPLS VPN thanks toa carrier network, via one or more links exhibiting very differentnetwork and service level agreement characteristics. For the sake ofillustration, a given customer site may fall under any of the followingcategories:

1.) Site Type A: a site connected to the network (e.g., via a private orVPN link) using a single CE router and a single link, with potentially abackup link (e.g., a 3G/4G/LTE backup connection). For example, aparticular CE router 110 shown in network 100 may support a givencustomer site, potentially also with a backup link, such as a wirelessconnection.

2.) Site Type B: a site connected to the network using two MPLS VPNlinks (e.g., from different Service Providers), with potentially abackup link (e.g., a 3G/4G/LTE connection). A site of type B may itselfbe of different types:

2a.) Site Type B1: a site connected to the network using two MPLS VPNlinks (e.g., from different Service Providers), with potentially abackup link (e.g., a 3G/4G/LTE connection).

2b.) Site Type B2: a site connected to the network using one MPLS VPNlink and one link connected to the public Internet, with potentially abackup link (e.g., a 3G/4G/LTE connection). For example, a particularcustomer site may be connected to network 100 via PE-3 and via aseparate Internet connection, potentially also with a wireless backuplink.

2c.) Site Type B3: a site connected to the network using two linksconnected to the public Internet, with potentially a backup link (e.g.,a 3G/4G/LTE connection).

Notably, MPLS VPN links are usually tied to a committed service levelagreement, whereas Internet links may either have no service levelagreement at all or a loose service level agreement (e.g., a “GoldPackage” Internet service connection that guarantees a certain level ofperformance to a customer site).

3.) Site Type C: a site of type B (e.g., types B1, B2 or B3) but withmore than one CE router (e.g., a first CE router connected to one linkwhile a second CE router is connected to the other link), andpotentially a backup link (e.g., a wireless 3G/4G/LTE backup link). Forexample, a particular customer site may include a first CE router 110connected to PE-2 and a second CE router 110 connected to PE-3.

FIG. 1B illustrates an example of network 100 in greater detail,according to various embodiments. As shown, network backbone 130 mayprovide connectivity between devices located in different geographicalareas and/or different types of local networks. For example, network 100may comprise local/branch networks 160, 162 that include devices/nodes10-16 and devices/nodes 18-20, respectively, as well as a datacenter/cloud environment 150 that includes servers 152-154. Notably,local networks 160-162 and data center/cloud environment 150 may belocated in different geographic locations.

Servers 152-154 may include, in various embodiments, a networkmanagement server (NMS), a dynamic host configuration protocol (DHCP)server, a constrained application protocol (CoAP) server, an outagemanagement system (OMS), an application policy infrastructure controller(APIC), an authentication, authorization and accounting (AAA) server, anapplication server, etc. As would be appreciated, network 100 mayinclude any number of local networks, data centers, cloud environments,devices/nodes, servers, etc.

In some embodiments, the techniques herein may be applied to othernetwork topologies and configurations. For example, the techniquesherein may be applied to peering points with high-speed links, datacenters, etc.

In various embodiments, network 100 may include one or more meshnetworks, such as an Internet of Things network. Loosely, the term“Internet of Things” or “IoT” refers to uniquely identifiable objects(things) and their virtual representations in a network-basedarchitecture. In particular, the next frontier in the evolution of theInternet is the ability to connect more than just computers andcommunications devices, but rather the ability to connect “objects” ingeneral, such as lights, appliances, vehicles, heating, ventilating, andair-conditioning (HVAC), windows and window shades and blinds, doors,locks, etc. The “Internet of Things” thus generally refers to theinterconnection of objects (e.g., smart objects), such as sensors andactuators, over a computer network (e.g., via IP), which may be thepublic Internet or a private network.

Notably, shared-media mesh networks, such as wireless or PLC networks,etc., are often on what is referred to as Low-Power and Lossy Networks(LLNs), which are a class of network in which both the routers and theirinterconnect are constrained: LLN routers typically operate withconstraints, e.g., processing power, memory, and/or energy (battery),and their interconnects are characterized by, illustratively, high lossrates, low data rates, and/or instability. LLNs are comprised ofanything from a few dozen to thousands or even millions of LLN routers,and support point-to-point traffic (between devices inside the LLN),point-to-multipoint traffic (from a central control point such at theroot node to a subset of devices inside the LLN), andmultipoint-to-point traffic (from devices inside the LLN towards acentral control point). Often, an IoT network is implemented with anLLN-like architecture. For example, as shown, local network 160 may bean LLN in which CE-2 operates as a root node for nodes/devices 10-16 inthe local mesh, in some embodiments.

In contrast to traditional networks, LLNs face a number of communicationchallenges. First, LLNs communicate over a physical medium that isstrongly affected by environmental conditions that change over time.Some examples include temporal changes in interference (e.g., otherwireless networks or electrical appliances), physical obstructions(e.g., doors opening/closing, seasonal changes such as the foliagedensity of trees, etc.), and propagation characteristics of the physicalmedia (e.g., temperature or humidity changes, etc.). The time scales ofsuch temporal changes can range between milliseconds (e.g.,transmissions from other transceivers) to months (e.g., seasonal changesof an outdoor environment). In addition, LLN devices typically uselow-cost and low-power designs that limit the capabilities of theirtransceivers. In particular, LLN transceivers typically provide lowthroughput. Furthermore, LLN transceivers typically support limited linkmargin, making the effects of interference and environmental changesvisible to link and network protocols. The high number of nodes in LLNsin comparison to traditional networks also makes routing, quality ofservice (QoS), security, network management, and traffic engineeringextremely challenging, to mention a few.

FIG. 2 is a schematic block diagram of an example node/device 200 thatmay be used with one or more embodiments described herein, e.g., as anyof the computing devices shown in FIGS. 1A-1B, particularly the PErouters 120, CE routers 110, nodes/device 10-20, servers 152-154 (e.g.,a network controller located in a data center, etc.), any othercomputing device that supports the operations of network 100 (e.g.,switches, etc.), or any of the other devices referenced below. Thedevice 200 may also be any other suitable type of device depending uponthe type of network architecture in place, such as IoT nodes, etc.Device 200 comprises one or more network interfaces 210, one or moreprocessors 220, and a memory 240 interconnected by a system bus 250, andis powered by a power supply 260.

The network interfaces 210 include the mechanical, electrical, andsignaling circuitry for communicating data over physical links coupledto the network 100. The network interfaces may be configured to transmitand/or receive data using a variety of different communicationprotocols. Notably, a physical network interface 210 may also be used toimplement one or more virtual network interfaces, such as for virtualprivate network (VPN) access, known to those skilled in the art.

The memory 240 comprises a plurality of storage locations that areaddressable by the processor(s) 220 and the network interfaces 210 forstoring software programs and data structures associated with theembodiments described herein. The processor 220 may comprise necessaryelements or logic adapted to execute the software programs andmanipulate the data structures 245. An operating system 242 (e.g., theInternetworking Operating System, or IOS®, of Cisco Systems, Inc.,another operating system, etc.), portions of which are typicallyresident in memory 240 and executed by the processor(s), functionallyorganizes the node by, inter alia, invoking network operations insupport of software processors and/or services executing on the device.These software processors and/or services may comprise a networkassurance process 248, as described herein, any of which mayalternatively be located within individual network interfaces.

It will be apparent to those skilled in the art that other processor andmemory types, including various computer-readable media, may be used tostore and execute program instructions pertaining to the techniquesdescribed herein. Also, while the description illustrates variousprocesses, it is expressly contemplated that various processes may beembodied as modules configured to operate in accordance with thetechniques herein (e.g., according to the functionality of a similarprocess). Further, while processes may be shown and/or describedseparately, those skilled in the art will appreciate that processes maybe routines or modules within other processes.

Network assurance process 248 includes computer executable instructionsthat, when executed by processor(s) 220, cause device 200 to performnetwork assurance functions as part of a network assuranceinfrastructure within the network. In general, network assurance refersto the branch of networking concerned with ensuring that the networkprovides an acceptable level of quality in terms of the user experience.For example, in the case of a user participating in a videoconference,the infrastructure may enforce one or more network policies regardingthe videoconference traffic, as well as monitor the state of thenetwork, to ensure that the user does not perceive potential issues inthe network (e.g., the video seen by the user freezes, the audio outputdrops, etc.).

In some embodiments, network assurance process 248 may use any number ofpredefined health status rules, to enforce policies and to monitor thehealth of the network, in view of the observed conditions of thenetwork. For example, one rule may be related to maintaining the serviceusage peak on a weekly and/or daily basis and specify that if themonitored usage variable exceeds more than 10% of the per day peak fromthe current week AND more than 10% of the last four weekly peaks, aninsight alert should be triggered and sent to a user interface.

Another example of a health status rule may involve client transitionevents in a wireless network. In such cases, whenever there is a failurein any of the transition events, the wireless controller may send areason_code to the assurance system. To evaluate a rule regarding theseconditions, the network assurance system may then group 150 failuresinto different “buckets” (e.g., Association, Authentication, Mobility,DHCP, WebAuth, Configuration, Infra, Delete, De-Authorization) andcontinue to increment these counters per service set identifier (SSID),while performing averaging every five minutes and hourly. The system mayalso maintain a client association request count per SSID every fiveminutes and hourly, as well. To trigger the rule, the system mayevaluate whether the error count in any bucket has exceeded 20% of thetotal client association request count for one hour.

In various embodiments, network assurance process 248 may also utilizemachine learning techniques, to enforce policies and to monitor thehealth of the network. In general, machine learning is concerned withthe design and the development of techniques that take as inputempirical data (such as network statistics and performance indicators),and recognize complex patterns in these data. One very common patternamong machine learning techniques is the use of an underlying model M,whose parameters are optimized for minimizing the cost functionassociated to M, given the input data. For instance, in the context ofclassification, the model M may be a straight line that separates thedata into two classes (e.g., labels) such that M=a*x+b*y+c and the costfunction would be the number of misclassified points. The learningprocess then operates by adjusting the parameters a,b,c such that thenumber of misclassified points is minimal. After this optimization phase(or learning phase), the model M can be used very easily to classify newdata points. Often, M is a statistical model, and the cost function isinversely proportional to the likelihood of M, given the input data.

In various embodiments, network assurance process 248 may employ one ormore supervised, unsupervised, or semi-supervised machine learningmodels. Generally, supervised learning entails the use of a training setof data, as noted above, that is used to train the model to apply labelsto the input data. For example, the training data may include samplenetwork observations that do, or do not, violate a given network healthstatus rule and are labeled as such. On the other end of the spectrumare unsupervised techniques that do not require a training set oflabels. Notably, while a supervised learning model may look forpreviously seen patterns that have been labeled as such, an unsupervisedmodel may instead look to whether there are sudden changes in thebehavior. Semi-supervised learning models take a middle ground approachthat uses a greatly reduced set of labeled training data.

Example machine learning techniques that network assurance process 248can employ may include, but are not limited to, nearest neighbor (NN)techniques (e.g., k-NN models, replicator NN models, etc.), statisticaltechniques (e.g., Bayesian networks, etc.), clustering techniques (e.g.,k-means, mean-shift, etc.), neural networks (e.g., reservoir networks,artificial neural networks, etc.), support vector machines (SVMs),logistic or other regression, Markov models or chains, principalcomponent analysis (PCA) (e.g., for linear models), multi-layerperceptron (MLP) ANNs (e.g., for non-linear models), replicatingreservoir networks (e.g., for non-linear models, typically for timeseries), random forest classification, or the like.

The performance of a machine learning model can be evaluated in a numberof ways based on the number of true positives, false positives, truenegatives, and/or false negatives of the model. For example, the falsepositives of the model may refer to the number of times the modelincorrectly predicted whether a network health status rule was violated.Conversely, the false negatives of the model may refer to the number oftimes the model predicted that a health status rule was not violatedwhen, in fact, the rule was violated. True negatives and positives mayrefer to the number of times the model correctly predicted whether arule was violated or not violated, respectively. Related to thesemeasurements are the concepts of recall and precision. Generally, recallrefers to the ratio of true positives to the sum of true positives andfalse negatives, which quantifies the sensitivity of the model.Similarly, precision refers to the ratio of true positives the sum oftrue and false positives.

FIG. 3 illustrates an example network assurance system 300, according tovarious embodiments. As shown, at the core of network assurance system300 may be a cloud service 302 that leverages machine learning insupport of cognitive analytics for the network, predictive analytics(e.g., models used to predict user experience, etc.), troubleshootingwith root cause analysis, and/or trending analysis for capacityplanning. Generally, architecture 300 may support both wireless andwired network, as well as LLNs/IoT networks.

In various embodiments, cloud service 302 may oversee the operations ofthe network of an entity (e.g., a company, school, etc.) that includesany number of local networks. For example, cloud service 302 may overseethe operations of the local networks of any number of branch offices(e.g., branch office 306) and/or campuses (e.g., campus 308) that may beassociated with the entity. Data collection from the various localnetworks/locations may be performed by a network data collectionplatform 304 that communicates with both cloud service 302 and themonitored network of the entity.

The network of branch office 306 may include any number of wirelessaccess points 320 (e.g., a first access point AP1 through nth accesspoint, APn) through which endpoint nodes may connect. Access points 320may, in turn, be in communication with any number of wireless LANcontrollers (WLCs) 326 (e.g., supervisory devices that provide controlover APs) located in a centralized datacenter 324. For example, accesspoints 320 may communicate with WLCs 326 via a VPN 322 and network datacollection platform 304 may, in turn, communicate with the devices indatacenter 324 to retrieve the corresponding network feature data fromaccess points 320, WLCs 326, etc. In such a centralized model, accesspoints 320 may be flexible access points and WLCs 326 may be N+1 highavailability (HA) WLCs, by way of example.

Conversely, the local network of campus 308 may instead use any numberof access points 328 (e.g., a first access point AP1 through nth accesspoint APm) that provide connectivity to endpoint nodes, in adecentralized manner. Notably, instead of maintaining a centralizeddatacenter, access points 328 may instead be connected to distributedWLCs 330 and switches/routers 332. For example, WLCs 330 may be 1:1 HAWLCs and access points 328 may be local mode access points, in someimplementations.

To support the operations of the network, there may be any number ofnetwork services and control plane functions 310. For example, functions310 may include routing topology and network metric collection functionssuch as, but not limited to, routing protocol exchanges, pathcomputations, monitoring services (e.g., NetFlow or IPFIX exporters),etc. Further examples of functions 310 may include authenticationfunctions, such as by an Identity Services Engine (ISE) or the like,mobility functions such as by a Connected Mobile Experiences (CMX)function or the like, management functions, and/or automation andcontrol functions such as by an APIC-Enterprise Manager (APIC-EM).

During operation, network data collection platform 304 may receive avariety of data feeds that convey collected data 334 from the devices ofbranch office 306 and campus 308, as well as from network services andnetwork control plane functions 310. Example data feeds may comprise,but are not limited to, management information bases (MIBS) with SimpleNetwork Management Protocol (SNMP)v2, JavaScript Object Notation (JSON)Files (e.g., WSA wireless, etc.), NetFlow/IPFIX records, logs reportingin order to collect rich datasets related to network control planes(e.g., Wi-Fi roaming, join and authentication, routing, QoS, PHY/MACcounters, links/node failures), traffic characteristics, and other suchtelemetry data regarding the monitored network. As would be appreciated,network data collection platform 304 may receive collected data 334 on apush and/or pull basis, as desired. Network data collection platform 304may prepare and store the collected data 334 for processing by cloudservice 302. In some cases, network data collection platform may alsoanonymize collected data 334 before providing the anonymized data 336 tocloud service 302.

In some cases, cloud service 302 may include a data mapper andnormalizer 314 that receives the collected and/or anonymized data 336from network data collection platform 304. In turn, data mapper andnormalizer 314 may map and normalize the received data into a unifieddata model for further processing by cloud service 302. For example,data mapper and normalizer 314 may extract certain data features fromdata 336 for input and analysis by cloud service 302.

In various embodiments, cloud service 302 may include a machine learning(ML)-based analyzer 312 configured to analyze the mapped and normalizeddata from data mapper and normalizer 314. Generally, analyzer 312 maycomprise a power machine learning-based engine that is able tounderstand the dynamics of the monitored network, as well as to predictbehaviors and user experiences, thereby allowing cloud service 302 toidentify and remediate potential network issues before they happen.

Machine learning-based analyzer 312 may include any number of machinelearning models to perform the techniques herein, such as for cognitiveanalytics, predictive analysis, and/or trending analytics as follows:

-   -   Cognitive Analytics Model(s): The aim of cognitive analytics is        to find behavioral patterns in complex and unstructured        datasets. For the sake of illustration, analyzer 312 may be able        to extract patterns of Wi-Fi roaming in the network and roaming        behaviors (e.g., the “stickiness” of clients to APs 320, 328,        “ping-pong” clients, the number of visited APs 320, 328, roaming        triggers, etc). Analyzer 312 may characterize such patterns by        the nature of the device (e.g., device type, OS) according to        the place in the network, time of day, routing topology, type of        AP/WLC, etc., and potentially correlated with other network        metrics (e.g., application, QoS, etc.). In another example, the        cognitive analytics model(s) may be configured to extract AP/WLC        related patterns such as the number of clients, traffic        throughput as a function of time, number of roaming processed,        or the like, or even end-device related patterns (e.g., roaming        patterns of iPhones, IoT Healthcare devices, etc.).    -   Predictive Analytics Model(s): These model(s) may be configured        to predict user experiences, which is a significant paradigm        shift from reactive approaches to network health. For example,        in a Wi-Fi network, analyzer 312 may be configured to build        predictive models for the joining/roaming time by taking into        account a large plurality of parameters/observations (e.g., RF        variables, time of day, number of clients, traffic load,        DHCP/DNS/Radius time, AP/WLC loads, etc.). From this, analyzer        312 can detect potential network issues before they happen.        Furthermore, should abnormal joining time be predicted by        analyzer 312, cloud service 312 will be able to identify the        major root cause of this predicted condition, thus allowing        cloud service 302 to remedy the situation before it occurs. The        predictive analytics model(s) of analyzer 312 may also be able        to predict other metrics such as the expected throughput for a        client using a specific application. In yet another example, the        predictive analytics model(s) may predict the user experience        for voice/video quality using network variables (e.g., a        predicted user rating of 1-5 stars for a given session, etc.),        as function of the network state. As would be appreciated, this        approach may be far superior to traditional approaches that rely        on a mean opinion score (MOS). In contrast, cloud service 302        may use the predicted user experiences from analyzer 312 to        provide information to a network administrator or architect in        real-time and enable closed loop control over the network by        cloud service 302, accordingly. For example, cloud service 302        may signal to a particular type of endpoint node in branch        office 306 or campus 308 (e.g., an iPhone, an IoT healthcare        device, etc.) that better QoS will be achieved if the device        switches to a different AP 320 or 328.    -   Trending Analytics Model(s): The trending analytics model(s) may        include multivariate models that can predict future states of        the network, thus separating noise from actual network trends.        Such predictions can be used, for example, for purposes of        capacity planning and other “what-if” scenarios.

Machine learning-based analyzer 312 may be specifically tailored for usecases in which machine learning is the only viable approach due to thehigh dimensionality of the dataset and patterns cannot otherwise beunderstood and learned. For example, finding a pattern so as to predictthe actual user experience of a video call, while taking into accountthe nature of the application, video CODEC parameters, the states of thenetwork (e.g., data rate, RF, etc.), the current observed load on thenetwork, destination being reached, etc., is simply impossible usingpredefined rules in a rule-based system.

Unfortunately, there is no one-size-fits-all machine learningmethodology that is capable of solving all, or even most, use cases. Inthe field of machine learning, this is referred to as the “No FreeLunch” theorem. Accordingly, analyzer 312 may rely on a set of machinelearning processes that work in conjunction with one another and, whenassembled, operate as a multi-layered kernel. This allows networkassurance system 300 to operate in real-time and constantly learn andadapt to new network conditions and traffic characteristics. In otherwords, not only can system 300 compute complex patterns in highlydimensional spaces for prediction or behavioral analysis, but system 300may constantly evolve according to the captured data/observations fromthe network.

Cloud service 302 may also include output and visualization interface318 configured to provide sensory data to a network administrator orother user via one or more user interface devices (e.g., an electronicdisplay, a keypad, a speaker, etc.). For example, interface 318 maypresent data indicative of the state of the monitored network, currentor predicted issues in the network (e.g., the violation of a definedrule, etc.), insights or suggestions regarding a given condition orissue in the network, etc. Cloud service 302 may also receive inputparameters from the user via interface 318 that control the operation ofsystem 300 and/or the monitored network itself. For example, interface318 may receive an instruction or other indication to adjust/retrain oneof the models of analyzer 312 from interface 318 (e.g., the user deemsan alert/rule violation as a false positive).

In various embodiments, cloud service 302 may further include anautomation and feedback controller 316 that provides closed-loop controlinstructions 338 back to the various devices in the monitored network.For example, based on the predictions by analyzer 312, the evaluation ofany predefined health status rules by cloud service 302, and/or inputfrom an administrator or other user via input 318, controller 316 mayinstruct an endpoint client device, networking device in branch office306 or campus 308, or a network service or control plane function 310,to adjust its operations (e.g., by signaling an endpoint to use aparticular AP 320 or 328, etc.).

As noted above, a network assurance system/service can leverage machinelearning-based anomaly detection to detect behavioral anomalies in amonitored network, such as a wireless network. Such a system may useoutlier detection to flag anomalies by leveraging statistical techniquesor thanks to the computation of predicted ranges using percentileregression, with the objective of detecting anomalies (rare events). Inmost cases, a second machine learning layer may also be used foridentifying the root causes of anomalies using common trait analysis,cross signal correlation, and/or predefined rules, with closed loopcontrol. However, finding root causes is challenging and that there isno one-size-fits-all approach and a collection of approaches are used incombination such as the ones listed above. Notably, one of thefundamental challenges of all machine learning-based anomaly detectionis to provide the models with the proper measurements/key performanceindicators (KPIs). If the proper KPI is not provided to the model asinput, the root cause of a detected anomaly simply cannot be determined.

Dynamic Inspection of Networking Dependencies to Enhance AnomalyDetection Models in a Network Assurance Service

The techniques herein introduce a mechanism that allows for theadjustment of the feature set of inputs to a machine learning model usedto assess the operations of a network, based on the networkingdependencies in the network. In some aspects, analysis of the networktopology can be used to augment the model with additionalfeatures/measurements, to continuously improve the efficacy of thesystem and the ability of the system to identify the root causes ofanomalies. In some implementations, data from an NMS or from thenetworking devices themselves can be correlated with the raisedanomalies, to determine if one of these devices could be the root causeand should be added to the input feature set of the anomaly detectionmechanism.

Specifically, according to one or more embodiments of the disclosure asdescribed in detail below, a network assurance service that monitors anetwork detects, using a machine learning-based anomaly detector,network anomalies associated with source nodes in the monitored network.The network assurance service identifies, for each of the detectedanomalies, a set of network paths between the source nodes associatedwith the anomaly and one or more potential destinations of traffic forthat source node. The network assurance service correlates networkingdevices along the network paths in the identified sets of network pathswith the detected network anomalies. The network assurance serviceadjusts the machine learning-based anomaly detector to use a performancemeasurement for a particular one of the networking devices as an inputfeature, based on the correlation between the particular networkingdevice and the detected network anomalies.

Illustratively, the techniques described herein may be performed byhardware, software, and/or firmware, such as in accordance with thenetwork assurance process 248, which may include computer executableinstructions executed by the processor 220 (or independent processor ofinterfaces 210) to perform functions relating to the techniquesdescribed herein.

Operationally, FIG. 4 illustrates an example architecture 400 forperforming the dynamic inspection of networking dependencies to enhanceanomaly detection models in a network assurance service, according tovarious embodiments. At the core of architecture 400 may be thefollowing components: one or more anomaly detectors 406, a networkdependency analyzer 408, a feature adjuster 410, and/or a feedbackcollection module (FCM) 412. In some implementations, the components406-412 of architecture 400 may be implemented within a networkassurance system, such as system 300 shown in FIG. 3. Accordingly, thecomponents 406-412 of architecture 400 shown may be implemented as partof cloud service 302 (e.g., as part of machine learning-based analyzer312 and/or output and visualization interface 318), as part of networkdata collection platform 304, and/or on one or more networkelements/entities 404 that communicate with one or more client devices402 within the monitored network itself. Further, these components406-412 may be implemented in a distributed manner or implemented as itsown stand-alone service, either as part of the local network underobservation or as a remote service. In addition, the functionalities ofthe components of architecture 400 may be combined, omitted, orimplemented as part of other processes, as desired.

During operation, service 302 may receive telemetry data from themonitored network (e.g., anonymized data 336 and/or data 334) and, inturn, assess the data using one or more anomaly detectors 406. At thecore of each anomaly detector 406 may be a corresponding anomalydetection model, such as an unsupervised learning-based model. When ananomaly detector 406 detects a network anomaly, output and visualizationinterface 318 may send an anomaly detection alert to a user interface(UI) for review by a subject matter expert (SME), network administrator,or other user. Notably, an anomaly detector 406 may assess any number ofdifferent network behaviors captured by the telemetry data (e.g., numberof wireless onboarding failures, onboarding times, DHCP failures, etc.)and, if the observed behavior differs from the modeled behavior by athreshold amount, the anomaly detector 406 may report the anomaly to theuser interface via network anomaly, output and visualization interface318.

According to various embodiments, architecture 400 may also includefeedback collection module (FCM) 412, such as part of output andvisualization interface 318 or other element of architecture 400. Duringoperation, FCM 412 is responsible for collecting feedback on differentalerts raised by service 302. In a simple embodiment, FCM 412 mayinclude a combination of UI elements provided to the UI (e.g., adisplay, etc.), application programming interfaces (APIs), and/ordatabases that allow rankers to provide explicit feedback on thedifferent alerts raised by service 302. These feedbacks are typically inthe form of like/dislike cues and are explicitly associated to a givenroot cause. In another embodiment, FCM 412 may allow for feedback in theform of free-form text input from the UI and leverage Natural LanguageUnderstanding and Sentiment Analysis to assign similar scores tounderlying root causes. Such an embodiment makes the process morenatural to the user, but at the expense of a level of indirection thatmust be accounted for when exploiting these feedbacks.

In a further embodiment, FCM 412 may collect feedback generated by athird party application/system in charge of exploiting the root causeproposed by the system. For example, automation and feedback controller316 or another mechanism may use root cause information for purposes ofremediation (e.g., by controlling or adjusting the monitored network)and, based on its effects, provide feedback to FCM 412. For example, ifthe root cause of an on-boarding issue relates to a specific devicecausing the trouble (e.g., client 402), such a mechanism could blacklistthe “bad apple.” Thus, if the issue does not persist after theremediation action, the mechanism could provide an automatic feedback,thus validating the root raised by the system in the first place.

In some embodiments, anomaly detector(s) 406 may also be configured toperform root cause analysis on any detected anomalies. For example, oneanomaly detection model may assess a certain feature set (e.g.,measurements) from the network, while another model works in conjunctionwith the first model to attempt to explain why the first model detectedan anomaly. By way of example, consider the case in which one model ofan anomaly detector 406 uses features/measurements such as throughput,packet loss, etc., while another model of the anomaly detector attemptsto determine the root cause of the behavioral anomalies by assessing thewireless channel in use, the number of attached clients to an AP, etc.

Rather than simply use a static feature set of measurements that ananomaly detector 406 may use for purposes of detecting behavioralanomalies in the network and/or the root cause of such an anomaly, thetechniques herein introduce a mechanism to dynamically adjust theassessed features based on the networking dependencies involved. To thisend, service 302 may be configured to report detected behavioralanomalies via an application program interface (API). For example, sucha reported anomaly may include any or all of the following information:

-   -   Device ID (e.g., MAC address, IP address, etc.)    -   Device-type (e.g. Wireless controller, Wireless Access Point,        Switch, etc.)    -   Severity of the anomaly (as specified by the AD system)    -   Time of the anomaly (with high accuracy using NTP)    -   Etc.        The same API can also be used by another Network Management        System (NMS) and/or in the form of a Simple Network Management        Protocol (SNMP) trap. In various embodiments, the reported        anomalies can be used to obtain topology information from the        network that may be associated with the detected anomalies.

In various embodiments, service 302 may include network dependencyanalyzer 408 that is configured to assess the networking dependencies ofthe networking devices (e.g., network entities 404) potentially involvedin a behavioral anomaly detected in the monitored network. For each typeof issue/anomaly, there may be one specific network path in the form ofS-D tuples, where S is the ID of the networking device/entity identifiedin the above API and associated with the detected anomaly, whereas D isthe potential destination of traffic associated with S.

By way of example, consider the case in which an anomaly detector 406detects an anomaly in the onboarding times of wireless clients 402 tothe monitored wireless network. In such a case, the behavioral anomaliesmay be associated with a particular AP (e.g., an entity 404 or even aclient 402), the ID for which can be represented as S. In such a case,networking dependency analyzer 408 may work in conjunction with anomalydetector 406 to identify the potential networking dependenciesassociated with this anomaly. Notably, in the case of wireless clientsonboarding onto a wireless network, the destination of the traffic maybe a DHCP server or AAA server located at a different location from thatof the onboarding client (e.g., onboarding of a client in local network160 in FIG. 1B may entail leveraging a server located in datacenter/cloud 150). In such a case, D may represent the list of potentialdestinations involved in the onboarding. In another example related tothroughput, D could be an exit gateway to the Internet, or a list ofservers, if the Wireless SSID points to a server in the Intranet.

During operation, networking dependency analyzer 408 may perform any orall of the following:

-   -   For each S-D pair, resolve the address of all potential        destinations. If D is a server type (e.g. DHCP, AAA, etc.), find        all (DHCP/AAA/ . . . ) server addresses that might have been        involved in the anomaly. Along with the server addresses there        may be other information that logically separates the network        paths—these also would be considered for the purpose of this        invention. For example, the VLAN ID would hit a different DHCP        pool resource within the same DHCP server. Hence, in this case        VLAN ID would be another attribute used along with the DHCP IP        address. This is particularly of interest in a fabric        architecture.    -   Retrieving routing topologies: compute all paths P₁, . . . ,        P_(n) involved between S and D(s). In one embodiment, analyzer        408 may do so via a routing lookup. Indeed, analyzer 408 may        interface with the routing domain (e.g., ISIS node with overload        bit set) and it could compute a shortest path first (SPF) to        find out all components listed in the paths. Note that in case        of load balancing along the paths, alternate routers along those        paths may have their Routing Information Base and/or Forwarding        Information Base (RIB/FIB) inspected. In another embodiment, a        second approach consisting in analyzer 408 causing Path Traces        probes (traceroutes) to be sent, potentially also with the        proper extensions to handle load balancing using MLPP and other        approaches.    -   Construct a dependency graph (DG) where the root is the source S        and leaves are all potential Ds, which include all potential        paths. In turn, analyzer 408 may add each element/networking        device E=<E₁, . . . ,E_(n)> of the set of Paths to a list L_(i)        for each anomaly i. Here, the Type of Service field or        application type may also be considered for trace routes. In        some anomalies where there are different applications and type        of services involved (e.g., radio throughput may include        multiple applications with different type of service), analyzer        408 may also consider the individual weighted path.    -   Correlate networking devices/elements with anomalies. In one        embodiment, networking dependency analyzer 408 may, for all        network elements listed in L, search for temporal correlations        between the nodes in the dependency graphs and the anomalies        detected by anomaly detector(s) 406. For example, in the case of        on-boarding time anomalies, analyzer 408 might find that there        is one common leaf between K dependency graphs (for K anomalies)        that have failed at the same time. Let LC be the list of common        elements found in multiple dependency trees for which analyzer        408 found time-correlation between failures. Note that the time        correlation between failures of elements of LC may be triggered        by automatic inspection of their respective logs. In this case,        analyzer 408 may remotely connect to each of these elements,        retrieving their respective logs and find time correlation        between failures, before checking if the original anomaly took        place at the same time. In another example, networking        dependency analyzer 408 may find out that a given link in the        network is shared between paths followed by the traffic        generated by two wireless APs and one may find some correlation        between the link failure (or high congestion state) and        throughput anomalies experienced by the traffic originated by        the two APs.

FIG. 5 illustrates an example dependency graph 500 that networkingdependency analyzer 408 may construct for a network behavioral anomalydetected by anomaly detector 406. In the case shown, assume that theanomaly is related to an onboarding anomaly (e.g., an anomalousonboarding time, etc.) associated with a source S, represented by rootnode 502 and its corresponding IP address, IP1. Since the anomaly isrelated to onboarding, there may be a set of potential destinations oftraffic for S, which may be represented in graph 500 as nodes 504 and506. Indeed, the set of destinations D may comprise the IP addresses ofa first DHCP server and an AAA server that may be involved in theonboarding. Note that the actual destination may not be readilyavailable from the detected onboarding anomaly, but may be identified bythe anomaly type or known clients or devices associated with the anomaly(e.g., a particular AP, etc.).

Using the S-D pair, networking dependency analyzer 408 may obtain thenetwork path information for the path(s) between S and the servers in D.In turn, analyzer 408 may represent the identified networking devices orother entities, such as tunnels or public networks, as their own nodesin graph 500. For example, as shown, graph 500 may also include node508, a first switch and node 510, a first router, that are part of thelocal network of S. Also as shown, the router represented by node 510may have dual connections to a second router, represented by node 516,via an MPLS TE tunnel (node 512) and via the public Internet (node 514).In turn, the second router, such as a router of a data center, may beconnected to a second switch, represented by node 516, which providesconnectivity to the two servers in D.

FIG. 6 illustrates a second example of a dependency graph 600, infurther embodiments. Similar to graph 500, graph 600 may include asource S, represented by node 602 and a list of potential destinationsD, represented by nodes 604-606, which networking dependency analyzer408 may identify from an anomaly detected by anomaly detector 406. Inturn, analyzer 408 may obtain the path information, such as from an NMS,RIB/FIB information of the devices, using a traceroute, or the like.Using this topology information, analyzer 408 may represent theidentified networking devices along the path(s) between S and D as nodes608-618, to represent the various switches, routers, links, and tunnelsthat may separate S from its potential destination(s) in D.

FIG. 7 illustrates an example plot 700 of network throughput for anetwork monitored during testing of the techniques herein. Plot 700illustrates two values over time: 1.) the measured throughput 702 in thenetwork and 2.) the prediction range 704 for the throughput by theanomaly detector 406. For the majority of the time, the measuredthroughput 702 falls within the prediction range 704 and, thus, anomalydetector 406 may deem the throughput to be normal. However, at point 706shown, the throughput suddenly drops below the prediction range 704 and,consequently, flagged as an anomaly by anomaly detector 406. However,detection of the unexpected drop in throughput at point 706 does notactually explain why the throughput drops.

For purposes of illustration of the techniques herein, assume that thedrop in throughput at point 706 was similarly experienced around thistime at six other sites in the network. For each of these throughputanomalies, network dependency analyzer 408 may identify the networkpaths between the sources associated with the anomalies and theirpotential destinations, to construct dependency graphs. In turn, networkdependency analyzer 408 may perform temporal correlation between thedependency graphs, to identify any common networking devices. Forexample, assume that each of the dependency graphs includes the sameWLC, indicating a strong likelihood that this WLC is the root cause ofthe throughput anomalies.

Referring again to FIG. 4, another potential component of architecture400 is feature adjuster 410, in some embodiments. During operation,feature adjuster 410 may identify the “missing” explanatory feature thatshould be used by the root causing model of anomaly detector(s) 406.Indeed, the subset of networking devices/elements for which a strongtime-correlation has been found by networking dependency analyzer 408can then be used by feature adjuster 410 to identify the missingKPI/input feature for the root cause model of anomaly detector 406. Forexample, in the case in which analyzer 408 determines that a particularWLC is highly correlated to throughput anomalies detected by an anomalydetector 406, feature adjuster 410 may add one or more measurements asinput features to the root cause model of the detector 406, such as theCPU load of the WLC, memory usage of the WLC, or the like. By doing so,not only can service 302 identify when anomalous behavior occurs in themonitored network, but also provide an explanation as to why the anomalyoccurred, as part of an alert sent by output and visualization interface318. For example, such an alert may indicate that a throughput anomalyoccurred and is likely due to a spike in the CPU usage by a particularWLC in the network.

FIG. 8 illustrates an example simplified procedure for dynamicinspection of networking dependencies to enhance anomaly detectionmodels, in accordance with one or more embodiments described herein. Forexample, a non-generic, specifically configured device (e.g., device200) may perform procedure 800 by executing stored instructions (e.g.,process 248) to provide a network assurance service to a monitorednetwork. The procedure 800 may start at step 805, and continues to step810, where, as described in greater detail above, the service maydetect, using a machine learning-based anomaly detector, networkanomalies associated with source nodes in the monitored network. Forexample, the source nodes may be particular wireless APs, clients, orother devices in the network that are associated with the anomalies. Thedetected anomalies can also be of any number of different anomaly types.For example, the detector may be configured to detect throughputanomalies, onboarding anomalies, or the like, in the monitored network.

At step 815, as detailed above, the service may identify, for each ofthe detected anomalies, a set of network paths between the source nodesassociated with the anomaly and one or more potential destinations oftraffic for that source node. For example, in the case of onboardinganomalies, the potential destinations of the traffic could be DHCP orAAA servers. In some embodiments, the service may receive routinginformation base (RIB) or forwarding information base (FIB) informationfor the network paths. In further embodiments, the service may usetraceroute information to form dependency graphs that connect the sourcenodes and the destinations.

At step 820, the service may correlate networking devices along thenetwork paths in the identified sets of network paths with the detectednetwork anomalies, as described in greater detail above. For example, insome cases, the service may correlate the detected anomalies withnetwork device failure alarms from a network management system. Infurther embodiments, the service may represent the networking devicesalong the paths in dependency graphs and, in turn, perform temporalcorrelation to identify any networking devices that appear across theset of detected anomalies.

At step 825, as detailed above, the service may adjust the machinelearning-based anomaly detector to use a performance measurement for aparticular one of the networking devices as an input feature, based onthe correlation between the particular networking device and thedetected network anomalies. For example, if a particular AP, APcontroller (e.g., WLC), switch, router, or the like, is highlycorrelated with the anomalies, the service may add one or moreperformance measurements for that device to the root cause model of theanomaly detector. By doing so, the detector can not only identifybehavioral anomalies in the network, but the detector can alsopotentially identify the cause of the anomaly. Procedure 800 then endsat step 830.

It should be noted that while certain steps within procedure 800 may beoptional as described above, the steps shown in FIG. 8 are merelyexamples for illustration, and certain other steps may be included orexcluded as desired. Further, while a particular order of the steps isshown, this ordering is merely illustrative, and any suitablearrangement of the steps may be utilized without departing from thescope of the embodiments herein.

The techniques described herein, therefore, allow for the use ofnetworking dependencies among networking devices in a monitored networkto be leveraged for purposes of explaining behavioral anomalies in thenetwork.

While there have been shown and described illustrative embodiments thatprovide for the dynamic inspection of networking dependencies to enhanceanomaly detection models in a network assurance service, it is to beunderstood that various other adaptations and modifications may be madewithin the spirit and scope of the embodiments herein. For example,while certain embodiments are described herein with respect to usingcertain models for purposes of anomaly detection, the models are notlimited as such and may be used for other functions, in otherembodiments. In addition, while certain protocols are shown, such asBGP, other suitable protocols may be used, accordingly.

The foregoing description has been directed to specific embodiments. Itwill be apparent, however, that other variations and modifications maybe made to the described embodiments, with the attainment of some or allof their advantages. For instance, it is expressly contemplated that thecomponents and/or elements described herein can be implemented assoftware being stored on a tangible (non-transitory) computer-readablemedium (e.g., disks/CDs/RAM/EEPROM/etc.) having program instructionsexecuting on a computer, hardware, firmware, or a combination thereof.Accordingly this description is to be taken only by way of example andnot to otherwise limit the scope of the embodiments herein. Therefore,it is the object of the appended claims to cover all such variations andmodifications as come within the true spirit and scope of theembodiments herein.

What is claimed is:
 1. A method comprising: detecting, by a networkassurance service that monitors a network and using a machinelearning-based anomaly detector, network anomalies associated withsource nodes in the monitored network; identifying, by the networkassurance service and for each of the detected anomalies, a set ofnetwork paths between the source nodes associated with the anomaly andone or more potential destinations of traffic for that source node;correlating, by the network assurance service, networking devices alongthe network paths in the identified sets of network paths with thedetected network anomalies; and adjusting, by the network assuranceservice, the machine learning-based anomaly detector to use aperformance measurement for a particular one of the networking devicesas an input feature, based on the correlation between the particularnetworking device and the detected network anomalies.
 2. The method asin claim 1, wherein the destinations comprise an authentication,authorization and accounting (AAA) server or a Dynamic HostConfiguration Protocol (DHCP) server.
 3. The method as in claim 1,wherein the particular networking device comprises a controller for oneor more wireless access points in the monitored network.
 4. The methodas in claim 1, wherein identifying, by the network assurance service andfor each of the detected anomalies, a set of network paths between thesource nodes associated with the anomaly and one or more potentialdestinations of traffic for that source node comprises: receivingrouting information base (RIB) or forwarding information base (FIB)information for the network paths.
 5. The method as in claim 1, whereincorrelating the networking devices along the network paths in theidentified sets of network paths with the detected network anomaliescomprises: correlating the detected anomalies with network devicefailure alarms from a network management system.
 6. The method as inclaim 1, wherein identifying, by the network assurance service and foreach of the detected anomalies, a set of network paths between thesource nodes associated with the anomaly and one or more potentialdestinations of traffic for that source node comprises: formingdependency graphs that connect the source nodes and the destinations,wherein networking devices along the paths are represented in the graphsas nodes.
 7. The method as in claim 6, wherein correlating thenetworking devices along the network paths in the identified sets ofnetwork paths with the detected network anomalies comprises: temporallycorrelating the dependency graphs to the detected anomalies; andidentifying the particular networking device as a potential cause of atleast a portion of the detected anomalies, based on the particularnetworking device appearing in a number of the dependency graphs thatare temporally correlated to the detected anomalies.
 8. An apparatus,comprising: one or more network interfaces to communicate with anetwork; a processor coupled to the network interfaces and configured toexecute one or more processes; and a memory configured to store aprocess executable by the processor, the process when executedconfigured to: detect, using a machine learning-based anomaly detector,network anomalies associated with source nodes in a monitored network;identify, for each of the detected anomalies, a set of network pathsbetween the source nodes associated with the anomaly and one or morepotential destinations of traffic for that source node; correlatenetworking devices along the network paths in the identified sets ofnetwork paths with the detected network anomalies; and adjust themachine learning-based anomaly detector to use a performance measurementfor a particular one of the networking devices as an input feature,based on the correlation between the particular networking device andthe detected network anomalies.
 9. The apparatus as in claim 8, whereinthe destinations comprise an authentication, authorization andaccounting (AAA) server or a Dynamic Host Configuration Protocol (DHCP)server.
 10. The apparatus as in claim 8, wherein the particularnetworking device comprises a controller for one or more wireless accesspoints in the monitored network.
 11. The apparatus as in claim 8,wherein the particular networking device comprises a controller for oneor more wireless access points in the monitored network.
 12. Theapparatus as in claim 8, wherein the apparatus identifies, for each ofthe detected anomalies, a set of network paths between the source nodesassociated with the anomaly and one or more potential destinations oftraffic for that source node by: forming dependency graphs that connectthe source nodes and the destinations, wherein networking devices alongthe paths are represented in the graphs as nodes.
 13. The apparatus asin claim 12, wherein the apparatus correlates the networking devicesalong the network paths in the identified sets of network paths with thedetected network anomalies by: temporally correlating the dependencygraphs to the detected anomalies; and identifying the particularnetworking device as a potential cause of at least a portion of thedetected anomalies, based on the particular networking device appearingin a number of the dependency graphs that are temporally correlated tothe detected anomalies.
 14. The apparatus as in claim 8, wherein theapparatus identifies, for each of the detected anomalies, a set ofnetwork paths between the source nodes associated with the anomaly andone or more potential destinations of traffic for that source node by:receiving routing information base (RIB) or forwarding information base(FIB) information for the network paths.
 15. The apparatus as in claim9, wherein the apparatus correlates the networking devices along thenetwork paths in the identified sets of network paths with the detectednetwork anomalies by: correlating the detected anomalies with networkdevice failure alarms from a network management system.
 16. A tangible,non-transitory, computer-readable medium storing program instructionsthat cause a network assurance service to execute a process comprising:detecting, by the network assurance service and using a machinelearning-based anomaly detector, network anomalies associated withsource nodes in the monitored network; identifying, by the networkassurance service and for each of the detected anomalies, a set ofnetwork paths between the source nodes associated with the anomaly andone or more potential destinations of traffic for that source node;correlating, by the network assurance service, networking devices alongthe network paths in the identified sets of network paths with thedetected network anomalies; and adjusting, by the network assuranceservice, the machine learning-based anomaly detector to use aperformance measurement for a particular one of the networking devicesas an input feature, based on the correlation between the particularnetworking device and the detected network anomalies.
 17. Thecomputer-readable medium as in claim 16, wherein the destinationscomprise an authentication, authorization and accounting (AAA) server ora Dynamic Host Configuration Protocol (DHCP) server.
 18. Thecomputer-readable medium as in claim 16, wherein the particularnetworking device comprises a controller for one or more wireless accesspoints in the monitored network.
 19. The computer-readable medium as inclaim 16, wherein identifying, by the network assurance service and foreach of the detected anomalies, a set of network paths between thesource nodes associated with the anomaly and one or more potentialdestinations of traffic for that source node comprises: formingdependency graphs that connect the source nodes and the destinations,wherein networking devices along the paths are represented in the graphsas nodes.
 20. The computer-readable medium as in claim 19, whereincorrelating the networking devices along the network paths in theidentified sets of network paths with the detected network anomaliescomprises: temporally correlating the dependency graphs to the detectedanomalies; and identifying the particular networking device as apotential cause of at least a portion of the detected anomalies, basedon the particular networking device appearing in a number of thedependency graphs that are temporally correlated to the detectedanomalies.